Like with \(\hat{p}\), the difference of two sample proportions \(\hat{p}_1 - \hat{p}_2\) can be modeled using a normal distribution (when conditions are met).
This standard error comes from the fact that variances of independent variables add, even when subtracting.
Differences of Two Proportions: Standard Errors
When we talk about the spread of an estimate, we’re really talking about variance (the square of the standard error).
If two random variables A and B are independent, then:
\(\text{Var}(A - B) = \text{Var}(A) + \text{Var}(B)\)
This might seem counterintuitive — but remember:
Even if you’re subtracting two noisy measurements, the uncertainty (noise) from both still adds up.
Think of it like using two shaky rulers. Subtracting doesn’t cancel the shakiness — it just combines it!
Differences of Two Proportions: Simulation Setup
We’ll use simulation to understand the sampling distribution of sample proportions for two independent groups.
Group 1 has a true proportion \(p_1 = 0.5\)
Group 2 has a true proportion \(p_2 = 0.4\)
Each group has \(n = 500\) individuals per sample
We’ll repeat this sampling 1,000 times to observe variation in sample means
Differences of Two Proportions: Simulation
set.seed(1) # ensures reproducibilityB <-1000# number of simulationsn <-500# sample size per groupp1 <-0.5# true proportion in group 1p2 <-0.4# true proportion in group 2# Create empty vectors to store simulated means and SDsmean_x1 <- mean_x2 <-numeric(B)sd_x1 <- sd_x2 <-numeric(B)# Loop to simulate samples for both groupsfor (i in1:B) {# Generate random binary outcomes for group 1 (success = 1, failure = 0) x1 <-rbinom(n, size =1, prob = p1) mean_x1[i] <-mean(x1) # sample proportion for group 1 sd_x1[i] <-sd(x1) # sample SD for group 1# Repeat for group 2 x2 <-rbinom(n, size =1, prob = p2) mean_x2[i] <-mean(x2) sd_x2[i] <-sd(x2)}
Comparing Theoretical and Empirical Standard Errors
We can compare:
the theoretical standard error (from the formula)
the empirical standard error (from our simulations)
Comparing Theoretical and Empirical Standard Errors
# Theoretical standard error for a sample proportionsqrt(p1*(1-p1)/n)
[1] 0.02236068
# Empirical standard error from simulated datamean(sd_x1) /sqrt(n)
[1] 0.0223613
The first line gives the theoretical SE: \(\sqrt{p(1-p)/n}\)
The second line gives the empirical SE, based on simulated SDs
These values should be nearly identical, validating the normal approximation for large \(n\)
Sampling Distribution for One Group (p1 = 0.5)
We can visualize the distribution of sample proportions across simulations, and overlay a 95% confidence interval around the true mean.
Sampling Distribution for One Group (p1 = 0.5)
Sampling Distribution for One Group (p1 = 0.5)
Roughly 5% of simulated sample proportions should fall outside this interval — confirming the 95% confidence level’s interpretation.
# Calculate the proportion of estimates that fall outside the 95% CImean(ifelse(mean_x1 >= ub_x1 | mean_x1 <= lb_x1, 1, 0))
[1] 0.052
Sampling Distribution for One Group (p2 = 0.4)
Sampling Distribution for One Group (p2 = 0.4)
Again, around 5% of simulated estimates will fall outside the interval.
The spread is slightly narrower than for Group 1 because the variance is smaller.
# Share of points outside the 95% CImean(ifelse(mean_x2 >= ub_x2 | mean_x2 <= lb_x2, 1, 0))
[1] 0.048
Combining Two Proportions
Now that we understand the sampling variation of each group separately, we can combine them just as we would when estimating a difference in proportions:
Consider an experiment involving patients who underwent cardiopulmonary resuscitation (CPR) following a heart attack and were subsequently admitted to a hospital.
Patients were randomly assigned to either a treatment group (received a blood thinner) or a control group (no blood thinner).
The outcome of interest was survival for at least 24 hours.
Differences of two proportions: Example 1
Survived
Died
Total
Control
11
39
50
Treatment
14
26
40
Total
25
65
90
Differences of two proportions: Example 1
Create and interpret a 90% confidence interval of the difference for the survival rates in the CPR study.
sig_level z_score min max
1 0.01 2.450 -0.10396538 0.3639654
2 0.05 1.960 -0.05717230 0.3171723
3 0.10 1.645 -0.02709104 0.2870910
Differences of two proportions: Visualizing Confidence Intervals
Differences of two proportions: Interpretation
We are 90% confident that blood thinners change the 24-hour survival rate by between -3 and 29 percentage points for patients similar to those in the study.
Because 0% is within this range, the evidence is inconclusive — we cannot determine whether blood thinners help or harm heart attack patients who have undergone CPR.
Differences of Two Proportions: Example 2
A 5-year clinical trial evaluated whether fish oil supplements reduce the risk of heart attacks.
Each participant was randomly assigned to one of two groups:
Fish Oil group
Placebo group
We’ll examine heart attack outcomes across both groups.
Differences of Two Proportions: Example 2
Group
Heart Attack
No Event
Total
Fish Oil
145
12,788
12,933
Placebo
200
12,738
12,938
Differences of Two Proportions: Example 2
Construct a 95% confidence interval for the effect of fish oil on heart attack incidence among patients represented by this study.
Interpret the interval in context:
What does the direction and width of the interval suggest?
Is there evidence that fish oil has a meaningful effect on heart attack risk?
Differences of two proportions: Visualizing Confidence Intervals
Differences of two proportions: Interpretation
The point estimate for the effect of fish oil is approximately –0.0043,
meaning heart attacks occurred 0.43 percentage points less often in the fish-oil group than in the placebo group.
We are 90% confident that fish oil changes the heart-attack rate by between –0.66 and –0.19 percentage points for patients similar to those in the study.
Because this interval does not include 0, the reduction in heart-attack risk is statistically significant at the 10% (and even 5% an 1%) level.
Practical vs. Statistical Significance
While statistically significant, the effect size is extremely small — roughly 0.4 fewer heart attacks per 100 individuals.
In a large clinical sample, even minor effects can reach significance if variability is low.
From a practical standpoint, such a small reduction may not justify the cost, side effects, or adherence burden of treatment.
More on Two-Proportion Hypothesis Tests
When conducting a two-proportion hypothesis test, the null hypothesis is typically: \(H_0: p_1 - p_2 = 0\)
However, there are cases where we may want to test for a specific difference other than zero.
For example, suppose we want to test whether: \(H_0: p_1 - p_2 = 0.10\)
In contexts like these, we use the sample proportions \(\hat{p}_1\) and \(\hat{p}_2\) to check the success–failure condition and to construct the standard error.
Differences of Two Proportions: Example 3
A drone quadcopter company is considering a new manufacturer for rotor blades.
The new manufacturer is more expensive but claims that their higher-quality blades are 3% more reliable, meaning that 3% more blades pass inspection compared to the current supplier.
sig_level z_score min max
1 0.01 2.450 0.0009547225 0.05704528
2 0.05 1.960 0.0065637780 0.05143622
3 0.10 1.645 0.0101695994 0.04783040
Visualizing Confidence Intervals
Compute and Visualize the z-Statistic
z <- (point_est -0.03) / seset.seed(1)sim <-rnorm(1000, mean =0.03, sd = se)# Probability of observing a value this extreme or larger1-mean(ifelse(point_est >= sim, 1, 0))
[1] 0.004
# p-value (right-tailed)1-pnorm(z)
[1] 0.005648044
Visualizing the Sampling Distribution
Example 3: Conclusion
From the standard normal distribution:
The right-tail area is approximately 0.004
Doubling for a two-tailed test gives p = 0.008
Since p = 0.008 < 0.05, we reject the null hypothesis
We find statistically significant evidence that the higher-quality blades have a pass rate greater than 3% higher than the standard blades — exceeding the company’s claims.
Chi-Squared Distributions: Introduction
\(\chi\) = the greek letter for “chi” (pronounced like “kai”)
The \(\chi^2\) distribution is a continous probability distribution that is widely used in statistical inference.
Closely related to the standard normal distribution
If a variable \(Z\) has the standard normal distribution, then \(Z^2\) has the \(\chi^2\) distribution with one degree of freedom
Chi-Squared Distributions: Histograms
Chi-Squared Distributions: Definition
If \(Z_1, Z_2,..., Z_k\) are independent standard normal variables, then…
\(Z_1^2 + Z_2^2 +...+ Z_k^2\)
…has a \(\chi^2\) distribution with \(k\)degrees of freedom.
Degrees of Freedom: Concept
A degree of freedom (df) represents the number of independent pieces of information available to estimate something.
Whenever we calculate a statistic, we “use up” some information.
For example, once we estimate the sample mean, one data point can be perfectly predicted from the others.
So for a sample of size \(n\), only \((n - 1)\) observations are free to vary when computing the sample variance.
if \(i = 4\), \(x_1 = 8\), \(x_2 = 10\), \(x_3 = 12\) and \(\bar{x} = 10\)…
… then \(x_4 = 10\)
So even though we had four data points, only three were free to vary — the fourth is determined by the mean.
That’s why when calculating the sample variance, we divide by \(n\) instead of \(n-1\): one degree of freedom has been “used up” in estimating the mean.
Degrees of Freedom: Motivation
Why it matters:
Degrees of freedom tell us how much independent information our test or estimate is based on.
They affect the shape of sampling distributions (like \(t\), \(F\), and \(\chi^2\)), which in turn changes the critical values and p-values we use.
More degrees of freedom → more information → the distribution becomes narrower and more normal-looking.
Big idea:
Degrees of freedom link sample size, uncertainty, and the reliability of inference — they remind us that every time we estimate something, we “spend” information.
Chi-Squared Distributions: Properties
mean: \(\mu = k\)
variance: \(\sigma^2 = 2k\)
mode occurs at \(\mu - 2\)
set.seed(1)x <-rnorm(1000, mean =0, sd =1)x2 <- x^2mean(x2)
[1] 1.070115
var(x2)
[1] 2.291541
Chi-Squared Distributions: Multiple Degrees of Freedom
Chi-Squared Distributions: Multiple Degrees of Freedom
k <-2for(i in1:k){set.seed(i) x <-rnorm(1000, mean =0, sd =1) x2 <- x^2if(i ==1){ z2 <- x2 }else{ z2 <- z2 + x2 }}mean(z2); k
[1] 2.103046
[1] 2
var(z2); 2*k
[1] 4.281053
[1] 4
Chi-Squared Distributions: Multiple Degrees of Freedom
k <-20for(i in1:k){set.seed(i) x <-rnorm(1000, mean =0, sd =1) x2 <- x^2if(i ==1){ z2 <- x2 }else{ z2 <- z2 + x2 }}mean(z2); k
[1] 20.08944
[1] 20
var(z2); 2*k
[1] 40.06536
[1] 40
Chi-Squared Distributions: Multiple Degrees of Freedom
Chi-Squared Tests
In this section, we develop a method for assessing a null model when data are binned into categories.
This technique is commonly used in two settings:
To determine whether a sample is representative of a larger population across several groups.
To evaluate whether observed data follow a specified theoretical distribution (e.g., normal, geometric, Poisson).
Both scenarios rely on the same statistical tool — the chi-squared test — which allows us to examine all bins simultaneously, rather than testing one or two categories in isolation.
Chi-Squared Tests: Jurors Example
Consider data from a random sample of 275 jurors in a small county.
Each juror identified their racial group.
We would like to test whether this sample is racially representative of the eligible population of jurors (i.e., registered voters).
The p-value of 0.117 indicates that, if jurors were selected randomly according to the county’s racial composition, there is about an 11.7% chance of observing differences in juror counts at least as extreme as those in the sample
X2_sim <-rowSums( (Y - m)^2/ m )p_value1 <-mean(X2_sim >= X2_obs)p_value1
[1] 0.11668
Chi-Squred Tests: Visualizing the p-value
Chi-Squred Tests: Attempting a More Familiar Approach (Incorrect)
Chi-Squred Tests: Comparing Observed and Expected Rallies
Chi-Squred Tests: Simulating the Chi-Squared Distribution
for(i in1:6){set.seed(i) z <-rnorm(sum(x$observed), mean =0, sd =1)if(i ==1){ y <- z^2 }else{ y <- y + z^2 }}mean(ifelse(x2 > y, 1, 0))
[1] 0.5198238
pchisq(x2, df =6)
[1] 0.5175767
Chi-Squred Tests: Visualizing the p-Value
Intro to t-distributions
We rarely know the true population mean (\(\mu\)) — we estimate it from a sample.
When the population standard deviation (\(\sigma\)) is unknown (which is most of the time), we rely on the \(t\)-distribution instead of the normal model.
The \(t\)-test helps us quantify uncertainty and make inferences about population parameters using sample data.
t-distributions
t-distributions: Degrees of Freedom
Intuition of t-distributions
B <-10000set.seed(1)x <-rnorm(B, mean =0, sd =1)dx <-density(x)df <-3set.seed(1)t_vals <-replicate(B, { x <-rnorm(df+1) # sample size = 6 → df = 5mean(x) / (sd(x) /sqrt(df+1)) # t-statistic})dt <-density(t_vals)
Using the t-Distribution
We use a t-distribution with (df = n - 1) to model the sample mean when the sample size is \(n\).
As the number of observations increases, the degrees of freedom (df) also increase.
With larger df, the t-distribution more closely resembles the standard normal distribution.
When \(df \approx 30\) or greater, the t-distribution is nearly indistinguishable from the normal model.
In general, the degrees of freedom describe the shape of the t-distribution — larger df means the distribution becomes narrower and more normal-like.
Example: Mercury in Dolphins
Let’s apply the t-distribution in a real-world example involving mercury content in dolphin muscle.
Elevated mercury concentrations are a serious concern for both dolphins and humans who consume them.
We’ll construct a confidence interval for the average mercury concentration using a sample of 19 Risso’s dolphins from the Taiji area in Japan.
B <-100000n =19point_est <-4.4s <-2.3se <- x_int * s /sqrt(n)upper_est <- point_est + selower_est <- point_est - seround(c(lower_est, upper_est), 2)
[1] 3.3 5.5
Example 2: Are Runners Getting Faster?
Is the typical U.S. runner getting faster or slower over time?
We’ll explore this question using data from the Cherry Blossom 10-Mile Run, held each spring in Washington, D.C.
In 2006, the average finish time for all runners was 93.29 minutes (about 93 minutes and 17 seconds).
We’ll use a sample of 100 participants from the 2017 race to test whether the average finish time has changed.
In other words, are runners getting faster or slower, or has there been no meaningful change?
Because the p-value is smaller than 0.05, we reject the null hypothesis.
The data provide strong evidence that the average run time in 2017 differs from the 2006 average.
Since the observed mean time is higher than the 2006 mean, and we’ve rejected the null, we conclude that: Runners were slower on average in 2017 than in 2006.
Difference of Two Means
In this section, we consider the difference between two population means\(\mu_1 - \mu_2\) when the data are not paired.
As with a single sample, we must check certain conditions to ensure we can use the t-distribution.
The point estimate for the difference is \(\bar{x}_1 - \bar{x}_2\).
A new standard error formula is used to reflect variation across both samples.
Aside from these two differences, the procedures are nearly identical to the one-mean case.
Example 3: ESC and Heart Recovery
Does treatment using embryonic stem cells (ESCs) help improve heart function after a heart attack?
An experiment tested this question in sheep that had experienced a heart attack.
Each sheep was randomly assigned to either an ESC treatment group or a control group.
Researchers measured the change in heart pumping capacity after treatment.
A positive value indicates stronger recovery.
Our goal is to construct a 95% confidence interval for the effect of ESC treatment on heart recovery compared to the control group.
Example 3: Distribution
Example 3: Summary Statistics
y <-read.csv("stem_cell.csv")y$diff <- y$after - y$beforehead(y)
Sometimes we want to compare means across multiple groups.
We might be tempted to make pairwise comparisons — for example, comparing
Group 1 vs. Group 2
Group 1 vs. Group 3
Group 2 vs. Group 3
ANOVA: Comparing More Than Two Means
However, this approach can be misleading. With many groups, making multiple comparisons increases the chance of finding a difference by random chance, even if no real differences exist.
Instead, we use a holistic test — the Analysis of Variance (ANOVA) — to assess whether there is evidence that at least one group mean differs from the others.
In short: ANOVA saves the day by allowing us to test for overall mean differences across several groups at once.
Core Idea of ANOVA
We will learn a new method called analysis of variance (ANOVA) and a new test statistic called \(F\)
ANOVA uses a single hypothesis test to check whether the means across many groups are equal
\(H_0 = \mu_1 = \mu_2 = ... = \mu_k\)
\(H_A =\) at least one mean is different
ANOVA Example: Exam Scores Across Classes
College departments often offer multiple sections of the same introductory course each semester to meet high student demand.
Suppose a statistics department runs three lectures (A, B, and C) of an introductory statistics course.
We want to know whether there are statistically significant differences in the average first exam scores across these three classes.
ANOVA allows us to test whether any of the class means differ, rather than relying on multiple pairwise comparisons.
ANOVA: SSG
Statisticians measure variation using squared differences. The sum of squares for the groups (SSG) measures the differences between the group means \(\bar{x}_i\) and the grand mean \(\bar{x}\) like so:
The mean square between groups (MSG) is calculated using both the groups’ degrees of freedom \(df_G\) and the sum of squares between groups SSG like so:
\[MSG = \dfrac{1}{df_G} SSG\]
k <-3df1 <- k -1MSG <- SSG / df1MSG
[1] 42.05833
ANOVA: SSE
The sum of squared errors (SSE) is the sum of the squares of the differences between each sample’s observations with each respective sample mean
The calculation is simplified by first finding the variance \(s_i^2\) of each sample: